45 research outputs found
Hard Mixtures of Experts for Large Scale Weakly Supervised Vision
Training convolutional networks (CNN's) that fit on a single GPU with
minibatch stochastic gradient descent has become effective in practice.
However, there is still no effective method for training large CNN's that do
not fit in the memory of a few GPU cards, or for parallelizing CNN training. In
this work we show that a simple hard mixture of experts model can be
efficiently trained to good effect on large scale hashtag (multilabel)
prediction tasks. Mixture of experts models are not new (Jacobs et. al. 1991,
Collobert et. al. 2003), but in the past, researchers have had to devise
sophisticated methods to deal with data fragmentation. We show empirically that
modern weakly supervised data sets are large enough to support naive
partitioning schemes where each data point is assigned to a single expert.
Because the experts are independent, training them in parallel is easy, and
evaluation is cheap for the size of the model. Furthermore, we show that we can
use a single decoding layer for all the experts, allowing a unified feature
embedding space. We demonstrate that it is feasible (and in fact relatively
painless) to train far larger models than could be practically trained with
standard CNN architectures, and that the extra capacity can be well used on
current datasets.Comment: Appearing in CVPR 201
Web-Scale Training for Face Identification
Scaling machine learning methods to very large datasets has attracted
considerable attention in recent years, thanks to easy access to ubiquitous
sensing and data from the web. We study face recognition and show that three
distinct properties have surprising effects on the transferability of deep
convolutional networks (CNN): (1) The bottleneck of the network serves as an
important transfer learning regularizer, and (2) in contrast to the common
wisdom, performance saturation may exist in CNN's (as the number of training
samples grows); we propose a solution for alleviating this by replacing the
naive random subsampling of the training set with a bootstrapping process.
Moreover, (3) we find a link between the representation norm and the ability to
discriminate in a target domain, which sheds lights on how such networks
represent faces. Based on these discoveries, we are able to improve face
recognition accuracy on the widely used LFW benchmark, both in the verification
(1:1) and identification (1:N) protocols, and directly compare, for the first
time, with the state of the art Commercially-Off-The-Shelf system and show a
sizable leap in performance